域的概括(DG)旨在在一个或多个不同但相关的源域上学习一个模型,这些模型可以推广到看不见的目标域。现有的DG方法试图提示模型的概括能力的源域的多样性,同时他们可能必须引入辅助网络或达到计算成本。相反,这项工作应用了特征空间中的隐式语义增强来捕获源域的多样性。具体来说,包括距离度量学习(DML)的附加损失函数,以优化数据分布的局部几何形状。此外,采用跨熵损失的逻辑被无限增强作为DML损失的输入特征,以代替深度特征。我们还提供了理论分析,以表明逻辑可以近似于原始特征上定义的距离。此外,我们对方法背后的机制和理性进行了深入的分析,这使我们可以更好地了解为什么要代替特征的杠杆逻辑可以帮助域的概括。拟议的DML损失与隐式增强作用纳入了最近的DG方法中,即傅立叶增强联合老师框架(FACT)。同时,我们的方法也可以轻松地插入各种DG方法中。对三个基准测试(Digits-DG,PAC和办公室家庭)进行的广泛实验表明,该建议的方法能够实现最新的性能。
translated by 谷歌翻译
As one of the prevalent methods to achieve automation systems, Imitation Learning (IL) presents a promising performance in a wide range of domains. However, despite the considerable improvement in policy performance, the corresponding research on the explainability of IL models is still limited. Inspired by the recent approaches in explainable artificial intelligence methods, we proposed a model-agnostic explaining framework for IL models called R2RISE. R2RISE aims to explain the overall policy performance with respect to the frames in demonstrations. It iteratively retrains the black-box IL model from the randomized masked demonstrations and uses the conventional evaluation outcome environment returns as the coefficient to build an importance map. We also conducted experiments to investigate three major questions concerning frames' importance equality, the effectiveness of the importance map, and connections between importance maps from different IL models. The result shows that R2RISE successfully distinguishes important frames from the demonstrations.
translated by 谷歌翻译
With the rapid deployment of graph neural networks (GNNs) based techniques into a wide range of applications such as link prediction, node classification, and graph classification the explainability of GNNs has become an indispensable component for predictive and trustworthy decision-making. Thus, it is critical to explain why graph neural network (GNN) makes particular predictions for them to be believed in many applications. Some GNNs explainers have been proposed recently. However, they lack to generate accurate and real explanations. To mitigate these limitations, we propose GANExplainer, based on Generative Adversarial Network (GAN) architecture. GANExplainer is composed of a generator to create explanations and a discriminator to assist with the Generator development. We investigate the explanation accuracy of our models by comparing the performance of GANExplainer with other state-of-the-art methods. Our empirical results on synthetic datasets indicate that GANExplainer improves explanation accuracy by up to 35\% compared to its alternatives.
translated by 谷歌翻译
Video Super-Resolution (VSR) aims to restore high-resolution (HR) videos from low-resolution (LR) videos. Existing VSR techniques usually recover HR frames by extracting pertinent textures from nearby frames with known degradation processes. Despite significant progress, grand challenges are remained to effectively extract and transmit high-quality textures from high-degraded low-quality sequences, such as blur, additive noises, and compression artifacts. In this work, a novel Frequency-Transformer (FTVSR) is proposed for handling low-quality videos that carry out self-attention in a combined space-time-frequency domain. First, video frames are split into patches and each patch is transformed into spectral maps in which each channel represents a frequency band. It permits a fine-grained self-attention on each frequency band, so that real visual texture can be distinguished from artifacts. Second, a novel dual frequency attention (DFA) mechanism is proposed to capture the global frequency relations and local frequency relations, which can handle different complicated degradation processes in real-world scenarios. Third, we explore different self-attention schemes for video processing in the frequency domain and discover that a ``divided attention'' which conducts a joint space-frequency attention before applying temporal-frequency attention, leads to the best video enhancement quality. Extensive experiments on three widely-used VSR datasets show that FTVSR outperforms state-of-the-art methods on different low-quality videos with clear visual margins. Code and pre-trained models are available at https://github.com/researchmm/FTVSR.
translated by 谷歌翻译
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously, towards high-quality realistic videos. To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled denoising autoencoders. In contrast to existing single-modal diffusion models, MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Two subnets for audio and video learn to gradually generate aligned audio-video pairs from Gaussian noises. To ensure semantic consistency across modalities, we propose a novel random-shift based attention block bridging over the two subnets, which enables efficient cross-modal alignment, and thus reinforces the audio-video fidelity for each other. Extensive experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks (e.g., video-to-audio). In particular, we achieve the best FVD and FAD on Landscape and AIST++ dancing datasets. Turing tests of 10k votes further demonstrate dominant preferences for our model. The code and pre-trained models can be downloaded at https://github.com/researchmm/MM-Diffusion.
translated by 谷歌翻译
Weakly-supervised temporal action localization (WTAL) learns to detect and classify action instances with only category labels. Most methods widely adopt the off-the-shelf Classification-Based Pre-training (CBP) to generate video features for action localization. However, the different optimization objectives between classification and localization, make temporally localized results suffer from the serious incomplete issue. To tackle this issue without additional annotations, this paper considers to distill free action knowledge from Vision-Language Pre-training (VLP), since we surprisingly observe that the localization results of vanilla VLP have an over-complete issue, which is just complementary to the CBP results. To fuse such complementarity, we propose a novel distillation-collaboration framework with two branches acting as CBP and VLP respectively. The framework is optimized through a dual-branch alternate training strategy. Specifically, during the B step, we distill the confident background pseudo-labels from the CBP branch; while during the F step, the confident foreground pseudo-labels are distilled from the VLP branch. And as a result, the dual-branch complementarity is effectively fused to promote a strong alliance. Extensive experiments and ablation studies on THUMOS14 and ActivityNet1.2 reveal that our method significantly outperforms state-of-the-art methods.
translated by 谷歌翻译
Applying suction grippers in unstructured environments is a challenging task because of depth and tilt errors in vision systems, requiring additional costs in elaborate sensing and control. To reduce additional costs, suction grippers with compliant bodies or mechanisms have been proposed; however, their bulkiness and limited allowable error hinder their use in complex environments with large errors. Here, we propose a compact suction gripper that can pick objects over a wide range of distances and tilt angles without elaborate sensing and control. The spring-inserted gripper body deploys and conforms to distant and tilted objects until the suction cup completely seals with the object and retracts immediately after, while holding the object. This seamless deployment and retraction is enabled by connecting the gripper body and suction cup to the same vacuum source, which couples the vacuum picking and retraction of the gripper body. Experimental results validated that the proposed gripper can pick objects within 79 mm, which is 1.4 times the initial length, and can pick objects with tilt angles up to 60{\deg}. The feasibility of the gripper was verified by demonstrations, including picking objects of different heights from the same picking height and the bin picking of transparent objects.
translated by 谷歌翻译
预先训练的图像文本模型(如剪辑)已经证明了从大规模的Web收集的图像文本数据中学到的视觉表示的强大力量。鉴于学习良好的视觉特征,一些现有的作品将图像表示转移到视频域并取得良好的结果。但是,如何利用图像语言预训练的模型(例如,剪辑)进行视频培训(后培训)仍在探索。在本文中,我们研究了两个问题:1)阻碍后期剪辑的因素是什么因素,以进一步提高视频语言任务的性能? 2)如何减轻这些因素的影响?通过一系列比较实验和分析,我们发现语言源之间的数据量表和域间隙具有很大的影响。由这些动机,我们提出了一种配备了视频代理机制的Omnisource跨模式学习方法,即剪辑,即剪辑VIP。广泛的结果表明,我们的方法可以提高视频检索的剪辑的性能。我们的模型还可以在包括MSR-VTT,DIDEMO,LSMDC和ActivityNet在内的各种数据集上实现SOTA结果。我们在https://github.com/microsoft/xpretrain/tree/main/main/main/clip-vip上发布了代码和预训练的剪辑模型。
translated by 谷歌翻译
AI Illustrator旨在自动设计具有视觉吸引力的图像,以激发丰富的思想和情感。为了实现这一目标,我们提出了一个框架,将具有复杂语义的原始描述转换为语义相应的图像。主要的挑战在于原始描述语义的复杂性,可能很难可视化(\ textit {e}。通常,它对现有方法构成了处理此类描述的挑战。为了解决这个问题,我们建议基于rompt \ textbf {c} ross- \ textbf {m} odal generation \ textbf {frame} work(pcm-frame)利用两个强大的预培养模型,,包括剪辑和Stylegan。我们的框架由两个组件组成:\ textIt {textIt嵌入} s到\ textit {image嵌入} s的投影模块,基于提示以及一个构建的适应图像生成模块,该模块构建了\ textit {image嵌入{image Embedding} s作为输入并受到共同语义一致性损失的训练。为了弥合现实图像和插图设计之间的差距,我们进一步采用了风格化模型作为后处理,以获得更好的视觉效果。受益于预先训练的模型,我们的方法可以处理复杂的描述,并且不需要外部配对数据进行培训。此外,我们已经建立了一个由200个原始描述组成的基准。我们进行了一项用户研究,以证明我们对复杂文本的竞争方法的优势。我们在https://github.com/researchmm/ai \ _illustrator} {https://github.com/researchmem/researchmm/ai \_illustrator上发布代码
translated by 谷歌翻译
图像增强旨在通过修饰颜色和音调来提高照片的美学视觉质量,并且是专业数字摄影的必不可少的技术。近年来,基于学习的图像增强算法已达到有希望的表现,并吸引了日益普及。但是,典型的努力试图为所有像素的颜色转换构建一个均匀的增强子。它忽略了对照片重要的不同内容(例如,天空,海洋等)之间的像素差异,从而导致结果不令人满意。在本文中,我们提出了一个新颖的可学习背景知觉的4维查找表(4D LUT),该表通过适应性地学习照片上下文来实现每个图像中不同内容的增强。特别是,我们首先引入一个轻量级上下文编码器和一个参数编码器,以分别学习像素级类别的上下文图和一组图像自适应系数。然后,通过通过系数集成多个基础4D LUT来生成上下文感知的4D LUT。最后,可以通过将源图像和上下文图馈入融合的上下文感知的4D〜LUT来获得增强的图像。与传统的3D LUT(即RGB映射到RGB)相比,通常用于摄像机成像管道系统或工具,4D LUT,即RGBC(RGB+上下文)映射到RGB,可实现具有不同像素的颜色转换的最佳控制每个图像中的内容,即使它们具有相同的RGB值。实验结果表明,我们的方法在广泛使用的基准中优于其他最先进的方法。
translated by 谷歌翻译